Some Distributional Properties of Mandarin Chinese --A Study Based on the Academia Sinica Corpus
نویسندگان
چکیده
The study of word frequency has been discussed by linguists, psychologists, and computer scientists. However, the results of these studies cannot be valid unless the corpus is big enough and properly-segmented. This paper observes the distributional information derived from word frequency based on . a 14-million-character corpus of Chinese newspaper (CKIP 1993). This is the first available Mandarin Chinese corpus of such magnitude. The word frequency count is obtained with an automatic-segmentation program with above 99% accuracy rate (Chen and Liu 1992). The count reflects some general phenomena of Chinese usage. For example, among the first thousand high frequency words, there are more bi-syllabic words than mono-syllabic words, attesting to the trend of bi-syllabicfication observed by many linguists. However, in general, the mono-syllabic function words occur more frequently than bi-syllabic words. In addition, the frequency of numerals is ranked according to their numeric order ('one' is higher than 'two', and 'two' is in turn higher than 'three', etc.) This paper discusses the theoretical and applicational implications of these distributional properties. For instance, we find that the most frequent 2452 characters and 28124 words make up 99% of the corpus content. It is suggested that the optimal strategy for learning Chinese lies in the mastery of the most frequent 2452 characters plus words whose meanings can not be predicted on the basis of their component characters. This implies that one need not know 28124 words in order to achieve good reading knowledge in Chinese. Given the noted parallel between the internal structure of words and phrases, one can predict that knowledge of a few thousand words and of the morphosyntactic rules will enable one to read 'Chinese without much difficulty.
منابع مشابه
Deriving Conceptual Structures from Sense: A Study of Near Synonymous Sensation Verbs
In Mandarin Chinese, lexical semantic relation of near synonyms is a widespread phenomenon, and is of great interest to many linguists. Most works deal with lexical semantic relation between lexical entries. This paper investigates the differences between Chinese near synonymous sensation verbs based on the data from “Academia Sinica Balanced Corpus of Modern Mandarin Chinese” (Sinica Corpus) a...
متن کاملThe expression of stance in Mandarin Chinese: A corpus-based study of stance adverbs
Stance-taking is considered as one of the fundamental properties of human communication (Jaffe, 2009). It is pervasive, intersubjective, and collaborative. While a good deal of research has investigated the expression of stance in English, much less has been done in Chinese. In this study, we draw upon the five-million-word Academia Sinica Balanced Corpus of Modern Chinese to investigate a comp...
متن کاملImportant and new features with analysis for disfluency interruption point (IP) detection in spontaneous Mandarin speech
This paper presents a whole set of new features, some duration-related and some pitch-related, to be used in disfluency interruption point (IP) detection for spontaneous Mandarin speech, considering the special linguistic characteristics of Mandarin Chinese. Decision tree is incorporated into the maximum entropy model to perform the IP detection. By examining performance degradation when each s...
متن کاملExploring Chinese Verbal Lexicon Developmental Trend with Semantic Space
This study assesses the influence of semantic space on the acquisition of verbal lexicon. The studied one hundred and fifty action verbs extracted from the experimental data in M3 project are classified into clusters in terms of meaning specificity. The semantic space variation of different clusters is examined in the distributional model based on Academia Sinica Balanced Corpus (ASBC) with Lat...
متن کاملQuality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System
We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it c...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006